34

Quantization of Neural Networks

where GIoU(·) is the generalized intersection over union function [202]. Each Gi reflects the

“closeness” of student proposals to the i-th ground-truth object. Then, we retain highly qual-

ified student proposals around at least one ground truth to benefit object recognition [235]

as:

bS

j =



bS

j ,

GIoU(bGT

i

, bS

j ) > τGi,i

,

otherwise,

(2.34)

where τ is a threshold controlling the proportion of distilled queries. After removing object-

empty () queries in ˜qS, we form a distillation-desired query set of students denoted as ˜qS

associated with its object set ˜yS = {˜cS

j ,˜bS

j } ˜

N

j=1. Correspondingly, we can obtain a teacher

query set ˜yT = {˜cT

j ,˜bT

j } ˜

N

j=1. For the j-th student query, its corresponding teacher query is

matched as:

˜cT

j ,˜bT

j = arg max

˜cT

k ,˜bT

k

N



k=1

μ1 GIoU(˜bS

j , bT

k )μ2˜bS

j bT

k 1,

(2.35)

where μ1 = 2 and μ2 = 5 control the matching function, values of which is to follow [31].

Finally, the upper-level optimization after rectification in Eq. (2.29) becomes:

min

θ

HqS|˜qT ).

(2.36)

Optimizing Eq. (2.36) is challenging. Alternatively, we minimize the norm distance be-

tween ˜qSand ˜qT , optima of which, i.e., ˜qS= ˜qT , is exactly the same with that in

Eq. (2.36). Thus, the final loss for our distribution rectification distillation loss becomes:

LDRDqS, ˜qT ) = E[˜DS˜DT 2],

(2.37)

where we use the Euclidean distance of co-attented feature ˜D (see Eq. 2.26) containing the

information query ˜q for optimization.

In backward propagation, the gradient updating drives the student queries toward their

teacher hints. Therefore, we accomplish our distillation. The overall training losses for our

Q-DETR model are:

L = LGT (yGT , yS) + λLDRDqS, ˜qT ),

(2.38)

where LGT is the common detection loss for missions such as proposal classification and

coordinate regression [31], and λ is a trade-offhyper-parameter.

2.4.5

Ablation Study

Datasets. We first conduct the ablative study and hyper-parameter selection on the PAS-

CAL VOC dataset [62], which contains natural images from 20 different classes. We use

the VOC trainval2012, and VOC trainval2007 sets to train our model, which contains

approximately 16k images, and the VOC test2007 set to evaluate our Q-DETR, which

contains 4952 images. We report COCO-style metrics for the VOC dataset: AP, AP50 (de-

fault VOC metric), and AP75. We further conduct the experiments on the COCO 2017

[145] object detection tracking. Specifically, we train the models on COCO train2017

and evaluate the models on COCO val2017. We list the average precision (AP) for

IoUs[0.5 : 0.05 : 0.95], designated as AP, using COCO’s standard evaluation metric.

For further analyzing our method, we also list AP50, AP75, APs, APm, and APl.

Implementation Details. Our Q-DETR is trained with the DETR [31] and SMCA-

DETR [70] framework. We select the ResNet-50 [84] and modify it with Pre-Activation

structures and RPReLU [158] function following [155]. PyTorch [185] is used for imple-

menting Q-DETR. We run the experiments on 8 NVIDIA Tesla A100 GPUs with 80 GB